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Categorization and supervised machine learning: Title extraction from bodies of HTML 
documents and its application to web page retrieval 

Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, Hang Li 
August 2005 Proceedings of the 28th annual international ACM SIGIR conference on 
Research and development in information retrieval SIGIR '05 

Full text available: *| | pdff347.22 KB) Additional Information: full citation , abstract , references , index terms 

This paper is concerned with automatic extraction of titles from the bodies of HTML 
documents. Titles of HTML documents should be correctly defined in the title fields; 
however, in reality HTML titles are often bogus. It is desirable to conduct automatic 
extraction of titles from the bodies of HTML documents. This is an issue which does not 
seem to have been investigated previously. In this paper, we take a supervised machine 
learning approach to address the problem. We propose a specification 0 ... 



Keywords: HTML document, information retrieval, metadata extraction 



2 Categorization and classification: An application of text categorization methods to gene 

ontology annotation 
Kazuhiro Seki, Javed Mostafa 

August 2005 Proceedings of the 28th annual international ACM SIGIR conference on 
Research and development in information retrieval SIGIR '05 

Full text available: ^ pdf(249.13 KB) Additional Information: full citation , abstract , references , index terms 

This paper describes an application of IR and text categorization methods to a highly 
practical problem in biomedicine, specifically, Gene Ontology (GO) annotation. GO 
annotation is a major activity in most model organism database projects and annotates 
gene functions using a controlled vocabulary. As a first step toward automatic GO 
annotation, we aim to assign GO domain codes given a specific gene and an article in which 
the gene appears, which is one of the task challenges at the TREC 2004 Ge ... 

Keywords: automatic database curation, genomic IR, text categorization 



3 Digital libraries and cyberinfastructure track: use of digital libraries in education: 
Comprehensive personalized information access in an educational digital library 
Peter Brusilovsky, Rosta Farzan, Jae-wook Ahn 

June 2005 Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries 
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This paper explores two ways to help students locate most relevant resources in educational 
digital libraries. One method gives a more comprehensive access to educational resources, 
through multiple pathways of information access, including browsing and information 
visualization. The second method is to access personalized information through social 
navigation support. This paper presents the details of the Knowledge Sea II system for 
comprehensive personalized access to educational resources an ... 



Keywords: classroom study, information map, social navigation 



Tools & techniques track: supporting classification: Automatic extraction of titles from 

general documents using machine learning 

Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Qinghua Zheng 

June 2005 Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries 

Full text available: ^| pdf(371.18 KB) Additional Information: full citation , abstract , references , index terms 

In this paper, we propose a machine learning approach to title extraction from general 
documents. By general documents, we mean documents that can belong to any one of a 
number of specific genres, including presentations, book chapters, technical papers, 
brochures, reports, and letters. Previously, methods have been proposed mainly for title 
extraction from research papers. It has not been clear whether it could be possible to 
conduct automatic title extraction from general documents. As a case ... 

Keywords: information extraction, machine learning, metadata extraction, search 



5 Digital libraries and cyberinfastructure track: use of digital libraries in the humanities: j|| 

Annotating illuminated manuscripts: an effective tool for research and education 
Maristella Agosti, Nicola Ferro, Nicola Orio 
, June 2005 Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries 

Full text available: ^ pdf( 1.10 MB) Additional Information: full citation , abstract , references , index terms 

The aim of this paper is to report the research results of an ongoing project that deals with 
the exploitation of a digital archive of drawings and illustrations of historic documents for 
research and educational purposes. According to the results on a study of user 
requirements, we have designed tools to provide researchers with innovative ways for 
accessing the digital manuscripts, sharing, and transferring knowledge in a collaborative 
environment. We have found that the results of scientific ... 

Keywords: annotation, digital images, education environment, user requirements 



6 Access control for XML document: Generalized XML security views 
Gabriel Kuper, Fabio Massacci, Nataliya Rassadko 

June 2005 Proceedings of the tenth ACM symposium on Access control models and 
technologies 

Full text available: Q pdf(168.30 KB) Additional Information: full citation , abstract , references , index terms 

We investigate a generalization of the notion of XML security view introduced by Stoica and 
Farkas [17] and later refined by Fan et al. [8]. The model consists of access control policies 
specified over DTDs with XPath expression for data-dependent access control policies. We 
provide the notion of security views for characterizing information accessible to authorized 
users. This is a transformed (sanitized) DTD schema that can be used by users for query 
formulation and optimization. Then w ... 
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7 Engineering design: Platform-independent accessibility API: accessible document 
object model 

Andres Gonzalez, Loretta Guarino Reid 

May 2005 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web 
Accessibility (W4A) W4A '05 

Full text available: ^ pdf(167.73 KB) Additional Information: full citation , abstract , references , index terms 

This paper addresses the problem of supporting accessibility in applications that run in 
multiple operating environments. It analyzes the commonalities of existing platform-specific 
Accessibility APIs, and defines a platform-independent accessibility API, the Accessible 
DOM. The Accessible DOM encompasses the features of existing APIs and overcomes the 
limitations of existing APIs to express dynamic, complex document contents.The Accessible 
DOM can be used to support existing and future platform- ... 

Keywords: W3C DOM, accessibility API 



8 Schemas and semantics: WEESA: Web engineering for semantic Web applications 
Gerald Reif, Harald Gall, Mehdi Jazayeri 

May 2005 Proceedings of the 14th international conference on World Wide Web 

Full text available: * ^ pdf(2Q3.78 KB) Additional Information: full citation , abstract , references , index terms 

The success of the Semantic Web crucially depends on the existence of Web pages that 
provide machine-understandable meta-data. This meta-data is typically added in the 
semantic annotation process which is currently not part of the Web engineering process. 
Web engineering, however, proposes methodologies to design, implement and maintain 
Web applications but lack the generation of meta-data. In this paper we introduce a 
technique to extend existing Web engineering methodologies to develop semanti ... 

Keywords: Web engineering, ontology, semantic Web, semantic annotation 



9 Web engineering with semantic annotation: Improving portlet interoperability through 
deep annotation 

Oscar Diaz, Jon Iturrioz, Arantza Irastorza 

May 2005 Proceedings of the 14th international conference on World Wide Web 

Full text available: *ff^ pdf(476.49 KB) Additional Information: full citation , abstract , references , index terms 

Portlets (i.e. multi-step, user-facing applications to be syndicated within a portal) are 
currently supported by most portal frameworks. However, there is not yet a definitive 
answer to portlet interoperation whereby data flows smoothly from one portlet to a 
neighbouring one. Both data-based and API-based approaches exhibit some drawbacks in 
either the limitation of the sharing scope or the standardization effort required. We argue 
that these limitations can be overcome by using deep annotation ... 

Keywords: data-flow, deep-annotation, event, portal ontology, portlet interoperability 



10 Text analysis and extraction: Gimme 1 the context: context-driven automatic semantic 

annotation with C-PANKOW 

Philipp Cimiano, Gunter Ladwig, Steffen Staab 

May 2005 Proceedings of the 14th international conference on World Wide Web 
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Full text available: ^ pdf(295.06 KB) Additional Information: full citation , abstract , references, index terms 

Without the proliferation of formal semantic annotations, the Semantic Web is certainly 
doomed to failure. In earlier work we presented a new paradigm to avoid this: the 'Self 
Annotating Web', in which globally available knowledge is used to annotate resources such 
as web pages. In particular, we presented a concrete method instantiating this paradigm, 
called PANKOW (Pattern-based ANnotation through Knowledge On the Web). In PANKOW, a 
named entity to be annotated is put into several linguistic p ... 

Keywords: annotation, information extraction, metadata, semantic Web 



11 Applications: Web-assisted annotation, semantic indexing and search of television and 
radio news 

Mike Dowman, Valentin Tablan, Hamish Cunningham, Borislav Popov 

May 2005 Proceedings of the 14th international conference on World Wide Web 

Full text available: pdf(403.97 KB) Additional Information: full citation , abstract , references , index terms 

The Rich News system, that can automatically annotate radio and television news with the 
aid of resources retrieved from the World Wide Web, is described. Automatic speech 
recognition gives a temporally precise but conceptually inaccurate annotation model. 
Information extraction from related web news sites gives the opposite: conceptual accuracy 
but no temporal data. Our approach combines the two for temporally accurate conceptual 
semantic annotation of broadcast news. First low quality transcri ... 

Keywords: Web search, automatic speech recognition, key-phrase extraction, media 
archiving, multimedia, natural language processing, semantic Web, semantic annotation, 
topical segmentation 



12 Late breaking results: posters: Annotating 3D electronic books 
Lichan Hong, Ed H. Chi, Stuart K. Card 

April 2005 CHI '05 extended abstracts on Human factors in computing systems 

Full text available: ^pdf(658.17 KB) Additional Information: full citation , abstract , references , index terms 

The importance of annotations, as a by-product of the reading activity, cannot be 
overstated. Annotations help users in the process of analyzing, re-reading, and recalling 
detailed facts such as prior analyses and relations to other works. As elec-tronic reading 
become pervasive, digital annotations will become part of the essential records of the 
reading activity. But creating and rendering annotations on a 3D book and other objects in 
a 3D workspace is non-trivial. In this paper, we present ou ... 

Keywords: 3D book, annotation, digital library, user interface 



13 Information access and retrieval (IAR): Retrieving lightly annotated images using 

image similarities 

Masashi Inoue, Naonori Ueda 

March 2005 Proceedings of the 2005 ACM symposium on Applied computing 

Full text available: pdf(207.74 KB) Additional Information: full citation , abstract , references , index terms 

Users' search needs are often represented by words and images are retrieved according to 
such textual queries. Annotation words assigned to the stored images are most useful to 
connect queries to the images. However, due to annotation cost, quite limited amount of 
annotation words are available in many cases. When annotations are not given at all, there 
needs to be some techniques that assign annotations automatically. When only a few 
annotation words are given to each image (lightly annotated), ... 
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Keywords: annotation-based retrieval, image retrieval, lightly annotated images, word 
associations 



14 Web technologies and applications (WTA): Survey of semantic annotation platforms 
Lawrence Reeve, Hyoil Han 

March 2005 Proceedings of the 2005 ACM symposium on Applied computing 

Full text available: ^ pdf(74.31 KB) Additional Information: full citation, abstract , references , index terms 

The realization of the Semantic Web requires the widespread availability of semantic 
annotations for existing and new documents on the Web. Semantic annotations are to tag 
ontology class instance data and map it into ontology classes. The fully automatic creation 
of semantic annotations is an unsolved problem. Instead, current systems focus on the 
semi-automatic creation of annotations. The Semantic Web also requires facilities for the 
storage of annotations and ontologies, user interfaces, acce ... 

Keywords: information extraction, semantic annotation, semantic web 



15 Long papers: knowledge acquisition and knowledge-based design: ClaimSpotter: an |§§ 
environment to support sensemaking with knowledge triples 
Bertrand Sereno, Simon Buckingham Shum, Enrico Motta 

January 2005 Proceedings of the 10th international conference on Intelligent user 
interfaces 

Full text available: *^ pdf(1.46 MB) Additional Information: full citation , abstract , references , index terms 

Annotating a document with an interpretation of its contents raises a number of challenges 
that we are hoping to address via the creation of a supporting environment. We present 
these challenges and motivate an approach based on the notion of suggestions to support 
document annotation, hoping these suggestions would act as leads to follow for annotators, 
therefore reducing some of the difficulties inherent to the task. The environment resulting 
from this approach, ClaimSpotter, is presented. Asp ... 

Keywords: annotation, interface, sensemaking, user studies 



16 KM-4 (knowledge management): distributed knowledge management: Towards 
smarter documents 

Vikas Krishna, Prasad M. Deshpande, Savitha Srinivasan 

November 2004 Proceedings of the thirteenth ACM conference on Information and 
knowledge management 

Full text available: pdf(224.70 KB) Additional Information: full citation , abstract , references , index terms 

Document analysis research typically focuses on document image understanding or classic 
problems in text classification, clustering, summarization and discovery. While that is an 
important aspect of document management, in practice, documents lifecycles are often 
determined by the context of the business process that they are relevant to. It therefore 
becomes necessary for the document analysis techniques to recognize and leverage the 
contextual information provided by a supporting schema and ... 

Keywords: classification, content, processes, workflow 



17 XML processing: A comprehensive solution to the XML-to-relational mapping problem 
Sihem Amer-Yahia, Fang Du, Juliana Freire 
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November 2004 Proceedings of the 6th annual ACM international workshop on Web 
information and data management 

Full text available: ^ pdf(114.33 KB) Additional Information: full citation , abstract , references , index terms 

The use of relational database management systems (RDBMSs) to store and query XML 
data has attracted considerable interest with a view to leveraging their powerful and 
reliable data management services. Due to the mismatch between the relational and XML 
data models, it is necessary to first shred and load the XML data into relational tables, and 
then btranslate XML queries over the original data into equivalent SQL queries over the 
mapped tables. Although there is a rich literature on XML-rel ... 

Keywords: XML shredding, XML storage, mapping techniques, relational databases 



18 Information sharing and access: Asynchronous collaborative writing through 
annotations 

Chunhua Weng, John H. Gennari 

November 2004 Proceedings of the 2004 ACM conference on Computer supported 
cooperative work 

Full text available: ^ pdf(260.11 KB) Additional Information: full citation , abstract , references , index terms 

Annotation is central to iterative reviewing and revising activities in asynchronous 
collaborative writing. Currently most digital annotation models and systems assume static 
context information and provide far less functionality than physical annotations. We extend 
prior annotation research by Marshall and Cadiz and design an activity-oriented annotation 
model to mimic the rich functionality of physical annotations for an enhanced collaborative 
writing process. In this model, we define an an ... 

Keywords: annotation, asynchronous collaboration, collaborative writing 



19 Document management: Accommodating paper in document databases 
Majed AbuSafiya, Subhasish Mazumdar 

October 2004 Proceedings of the 2004 ACM symposium on Document engineering 

Full text available: *|| pdf(234.77 KB) Additional Information: full citation , abstract , references , index terms 

Although the paperless office has been imminent for decades, documents in paper form 
continue to be used extensively in almost all organizations. Present-day information 
systems are designed on the premise that any paper document in use will be either 
converted into electronic form or merely printed from electronic file(s) accessible to the 
system. Yet, paper is the medium of choice in many situations, mainly owing to its 
portability and usability, and the medium of necessity in others, espec ... 

Keywords: RFID, document databases, document management, enterprise document 
model, paper documents, paper manifestation 



20 Document management: The lifecycle of a digital historical document: structure and 
content 

A. Antonacopoulos, D. Karatzas, H. Krawczyk, B. Wiszniewski 

October 2004 Proceedings of the 2004 ACM symposium on Document engineering 

Full text available: ||] pdf(313.24 KB) Additional Information: full citation , abstract , references , index terms 

This paper describes the lifecycle of a digital historical document, from template-based 
structure definition through to content extraction from the scanned pages and its final 
reconstitution as an electronic document (combining content and semantic information) 
along with the tools that have been created to realise each stage in the lifecycle. The whole 
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approach is described in the context of different types of typewritten documents relating to 
prisoners in World-War II concentration camps a ... 

Keywords: digital libraries, document analysis, document architecture, document 
engineering, historical documents, text enhancement 
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